Skip to content

FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526

Open
ffelixg wants to merge 16 commits into
microsoft:mainfrom
ffelixg:simdutf
Open

FIX: [PERF] Handle utf16 strings using simdutf and std::u16string#526
ffelixg wants to merge 16 commits into
microsoft:mainfrom
ffelixg:simdutf

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented Apr 15, 2026

Work Item / Issue Reference

GitHub Issue: #514


Summary

  • Use external dependency simdutf to decode utf16 when decoding in c++
  • Use pybind11 support for std::u16string to decode utf16 when interacting with python
  • Remove SQLWCHARToWString, WideToUTF8, Utf8ToWString and WStringToSQLWCHAR; refactor call sites

Copilot AI review requested due to automatic review settings April 15, 2026 23:50
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 15, 2026

I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit.

This brings arrow performance on linux for nvarchars in line with what it should be.

If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace WStringToSQLWCHAR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).

Changes:

  • Add simdutf (via find_package or FetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths.
  • Replace a number of std::wstring usages with std::u16string for connection strings, queries, and SQL_C_WCHAR parameter buffers.
  • Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mssql_python/pybind/CMakeLists.txt Adds simdutf dependency resolution and links it into the ddbc_bindings module.
mssql_python/pybind/ddbc_bindings.h Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string.
mssql_python/pybind/ddbc_bindings.cpp Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf.
mssql_python/pybind/connection/connection.h Updates connection APIs/state to store connection strings as std::u16string.
mssql_python/pybind/connection/connection.cpp Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly.
mssql_python/pybind/connection/connection_pool.h Updates pooling key types and APIs to use std::u16string.
mssql_python/pybind/connection/connection_pool.cpp Implements pooling with std::u16string connection string keys.
mssql_python/pybind/unix_utils.h Removes the SQLWCHARToWString declaration.
mssql_python/pybind/unix_utils.cpp Removes the SQLWCHARToWString implementation.
Comments suppressed due to low confidence (1)

mssql_python/pybind/ddbc_bindings.cpp:583

  • In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses data() + SQL_NTS and sets bufferLength to size() * sizeof(SQLWCHAR). For SQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (prefer c_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.

Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.

                    std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
                        paramBuffers, param.cast<std::u16string>());
                    LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
                        "length=%zu characters, buffer=%zu bytes",
                        paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
                    dataPtr = sqlwcharBuffer->data();
                    bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
                    strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
                    *strLenOrIndPtr = SQL_NTS;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

📊 Code Coverage Report

🔥 Diff Coverage

94%


🎯 Overall Coverage

26%


📈 Total Lines Covered: 7085 out of 26978
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (94.3%): Missing lines 1546,1621,1638,3753,3762,4095,4197
  • mssql_python/pybind/utf_utils.h (92.3%): Missing lines 15-16

Summary

  • Total: 154 lines
  • Missing: 9 lines
  • Coverage: 94%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1542-1550

  1542     LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
  1543     ErrorInfo errorInfo;
  1544     if (retcode == SQL_INVALID_HANDLE) {
  1545         LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1546         errorInfo.ddbcErrorMsg = "Invalid handle!";
  1547         return errorInfo;
  1548     }
  1549     assert(handle != 0);
  1550     SQLHANDLE rawHandle = handle->get();

Lines 1617-1625

  1617     return records;
  1618 }
  1619 
  1620 // Wrap SQLExecDirect
! 1621 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
  1622     LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
  1623         "query_length=%zu chars",
  1624         (void*)StatementHandle->get(), Query.length());
  1625     if (!SQLExecDirect_ptr) {

Lines 1634-1642

  1634         SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
  1635                            (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
  1636     }
  1637 
! 1638     SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
  1639     SQLRETURN ret;
  1640     {
  1641         // Release the GIL during the blocking ODBC call so that other Python
  1642         // threads (e.g. asyncio event loop, heartbeat threads) can run while

Lines 3749-3757

  3749                                      sizeof(DateTimeOffset) * fetchSize,
  3750                                      buffers.indicators[col - 1].data());
  3751                 break;
  3752             default:
! 3753                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3754                 std::ostringstream errorString;
  3755                 errorString << "Unsupported data type for column - " << columnName.c_str()
  3756                             << ", Type - " << dataType << ", column ID - " << col;
  3757                 LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 3758-3766

  3758                 ThrowStdException(errorString.str());
  3759                 break;
  3760         }
  3761         if (!SQL_SUCCEEDED(ret)) {
! 3762             std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3763             std::ostringstream errorString;
  3764             errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
  3765                         << dataType << ", column ID - " << col;
  3766             LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 4091-4099

  4091                     break;
  4092                 }
  4093                 default: {
  4094                     const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4095                     std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4096                     std::ostringstream errorString;
  4097                     errorString << "Unsupported data type for column - " << columnName.c_str()
  4098                                 << ", Type - " << dataType << ", column ID - " << col;
  4099                     LOG("FetchBatchData: %s", errorString.str().c_str());

Lines 4193-4201

  4193             case SQL_SS_TIMESTAMPOFFSET:
  4194                 rowSize += sizeof(DateTimeOffset);
  4195                 break;
  4196             default:
! 4197                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4198                 std::ostringstream errorString;
  4199                 errorString << "Unsupported data type for column - " << columnName.c_str()
  4200                             << ", Type - " << dataType << ", column ID - " << col;
  4201                 LOG("calculateRowSize: %s", errorString.str().c_str());

mssql_python/pybind/utf_utils.h

  11 #include <string>
  12 
  13 inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) {
  14     if (utf16.empty()) {
! 15         return {};
! 16     }
  17 
  18     simdutf::result utf8Length =
  19         simdutf::utf8_length_from_utf16le_with_replacement(utf16.data(), utf16.size());
  20     std::string utf8(utf8Length.count, '\0');


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 61.1%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 16, 2026

Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ffelixg ffelixg changed the title PERF: Use simdutf for decoding utf16 PERF: Handle utf16 strings using simdutf and std::u16string Apr 19, 2026
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 19, 2026

@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope.

I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path.

@gargsaumya
Copy link
Copy Markdown
Contributor

Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you.
I’ve approved the workflow for now so we can run the pipeline and get a passing test run.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed
@gargsaumya gargsaumya changed the title PERF: Handle utf16 strings using simdutf and std::u16string FIX: [PERF] Handle utf16 strings using simdutf and std::u16string Apr 20, 2026
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 20, 2026

I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya?

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@gargsaumya gargsaumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/CMakeLists.txt
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 21, 2026

Glad you like it @gargsaumya! I totally agree with both your comments and pushed changes accordingly.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 22, 2026

Seems like the same flaky test failed again

gargsaumya
gargsaumya previously approved these changes Apr 23, 2026
@jahnvi480
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sumitmsft
Copy link
Copy Markdown
Contributor

I have a few generic comments before I delve into core review:
@gargsaumya

  • Add a short note to CONTRIBUTING.md or a pybind/README.md: "Use std::u16string for all SQLWCHAR data; use simdutf for any UTF transcoding. Do not introduce std::wstring in the C++ bindings."
  • The three inline helpers (utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, reinterpretU16stringAsSqlWChar) are in ddbc_bindings.h. Would recommend to them to a dedicated utf_utils.h so they can be unit-tested in isolation.
  • We need to subscribe to simdutf releases. They have aggressive release timelines. We would want to evaluate before bumping its versions.

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
result.push_back(static_cast<wchar_t>(UNICODE_REPLACEMENT_CHAR));
}
}
inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) {
Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend to move utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, and reinterpretU16stringAsSqlWChar from ddbc_bindings.h into a small utf_utils.h so they can be unit tested and the changes don't force recompilation of the whole module.

I do not see dedicated unit tests for utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, or reinterpretU16stringAsSqlWChar. Edge cases (empty strings, surrogate pairs, lone surrogates) are only tested indirectly through integration tests. Once moved to utf_utils.h I would suggest to add targeted unit tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By targeted unit tests, am I correct in assuming that you mean pure c++ unit tests? Testing them via python wrappers and pytest feels wrong. It would also involve changes to the build process as well as CI pipelines, since there is currently no infrastructure for c++ tests. The predecessors of these functions were also only tested via integration or even fully inlined as was the case for some reinterpretU16stringAsSqlWChar applications.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I agree that setting up a full C++ unit test framework and CI pipeline changes is out of scope for this PR. Let's not block this PR on it, but I'd like to open a tracking issue for it in our internal backlog for future enhancements.

I see you've already added utf_utils.h, thank you! Let's keep moving with this PR. We are expecting one more review on this before we merge it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gargsaumya please track this: "but I'd like to open a tracking issue for it in our internal backlog for future enhancements" in ADO.

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented May 10, 2026

Thanks for the review @sumitmsft.

Regarding the first point, There's actually one more method that's currently being used to convert unicode aside from simdutf and std::u16string, which is the native python c api functions in ProcessWChar (fetchall/fetchmany). I changed your suggestion accordingly and put it into the pybind README. Not sure if you meant Saumya to do the documentation, since you tagged her, feel free to change it in any way @gargsaumya.

I hope the added dependency is not too much of an inconvenience. I looked into native iconv as well, but didn't find it appealing in comparison. The simdutf releases look to me like they are mostly about adding new features, which we likely won't care about. Since unicode conversion itself is a solved problem, I was hoping that we would technically be fine using the same version forever, assuming no crazy bugs.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/utf_utils.h Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants